Code
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsimport numpy as np
import matplotlib.pyplot as plt
import seaborn as snsNotes for Physics 312 class readings
Diffusion models are also a type of neural network (like large language models). However, compared to large language models, they serve a completely different purpose. Diffusion models are more known to the public as AI image generators. I personally first encountered OpenAI’s own image generation system called DALL-E 2, which is a specific model that generates images from text prompts.
Look at“A Shiba Inu dog wearing a beret and black turtleneck", which I generated using DALL-E 2 around late 2022.
We’ll first look at a general overview of ideas shown in Anil Ananthaswamy’s post in Quanta Magazine: “The Physics Principle That Inspired Modern AI Art” [1].
The goal of generative models is to learn a probability distribution for a set of images and generate a new data points that follow the original.
Earlier models that produce realistic image include generative adversarial networks (GAN). However, they are hard to train.
The post motivated how creating data from images works.
The plot below shows a visual example for 3 2-pixel images, and plotting them in 2D space.
sns.set(font_scale = 1.5, style = "white")
seed = 72596
np.random.seed(seed)
im_array = np.rint((np.random.random((3,2))*255))
fig, ax = plt.subplots(1,2, figsize = (10, 5))
ax[0].matshow(im_array, cmap = 'gray')
ax[0].set_yticks([0,1, 2], ["First image", "Second image", "Third image"]);
ax[0].set_xticks([0, 1], ["Pixel 1", "Pixel 2"]);
ax[1].plot(im_array[:,0], im_array[:,1], ls = 'None', marker = '.')
ax[1].set_xlim(0, 255)
ax[1].set_ylim(0, 255)
ax[1].set_aspect(1) #create plot with equal aspect ratio
ax[1].set_title("Three 2 pixel images in 2D space")Text(0.5, 1.0, 'Three 2 pixel images in 2D space')
sns.set(font_scale = 1.5, style = "white")
seed = 72596
np.random.seed(seed)
im_array = np.rint((np.random.random((2000,2))*255))
fig, ax = plt.subplots(1,2, figsize = (10, 5))
ax[0].plot(im_array[:,0], im_array[:,1], ls = 'None', marker = '.')
ax[0].set_xlim(0, 255)
ax[0].set_ylim(0, 255)
ax[0].set_title("2000 2-pixel images in 2D space")
sns.histplot(x = im_array[:,0], y = im_array[:,1], ax = ax[1], bins = 50)
ax[1].set_xlim(0, 255)
ax[1].set_ylim(0, 255)
ax[1].set_title("2D histogram = 50 bins")Text(0.5, 1.0, '2D histogram = 50 bins')
The example above is for random 2-pixel images created, so there are also random peaks. This may not be the case for real image data.
This probability distribution can be used to generate new images. The generated images should follow the empirical distribution of pixel 1 and 2.
Extending this to bigger images, the dimensionality of the problem increases (since each pixel is a dimension). Sampling in each dimension and laying them out together can recreate an image.
GANs are hard to train because they sometimes do not learn the full probability distribution for a set of images and can only generate a subset (the example given is: a model trained on different animals sometimes only generate pictures of dogs).
Diffusion models are inspired by nonequilibrium thermodynamics. Nonequilibrium dynamics describes the probability distribution for a diffusion process.
In the post, the example used is a drop of ink diffusing in a container.
Initially, the blue ink is localized in an area. To calculate the probability of finding an ink molecule in the container, a probability distribution that models the initial state is needed. (This kind of distribution is hard to sample from.)
After diffusing through the water, the ink molecules becomes more uniformly distributed over the water. This is easier to express mathematically.
The algorithm for generative modeling is taught how to create noise from images.
This diffusion model was initially published by Jascha Sohl-Dickstein [2].
Main issues: images are worse than GAN generations
Process is slow
Instead of estimating the probability distribution of the data, they used the gradient of the distribution of the data
Song worked on this without knowing about Sohl-Dickstein’s work.
Updated Sohl-Dickstein’s diffusion model with Song’s ideas
DDPMs matched or surpassed other generative models, including GANs (benchmark used: comparing distribution of generated images to training set)
More recent models all use a variation of DDPM.
Other models include training on text to allow generation of images based on text.
Models are prone to bias based on their training datasets.
Issue on ethics for scraped data (copyright, etc.)
Latent diffusion models (LDMs) were created with the goal of reducing the computational complexity of training and sampling without sacrificing the performance of diffusion models [5]. Instead of working on the pixel space, latent diffusion models are “made to learn a space that is perceptually equivalent to the image space1, but offers reduced computational complexity” (which is called the latent space)[5].
Stable Diffusion (SD) is builds upon the work done by Rombach et al.[5], and is a type of text-to-image LDM. SD is trained on Unlike other text-to-image LDMs (like DALL-E 2), SD is open source and is trained on images from the LAION, a non-profit that makes open-sourced AI models and datasets. Aside from using SD to generate images from text prompts, SD can be used for image modification (usually referred to as img2img), using an image as an input and a prompt to guide the generation.
SD has an available demo for testing out the outputs of SD 2.1. Here’s an example of images generated with the prompt Paris in a rainy day.
Paris on a rainy dayumbrella as a negative promptParis in a rainy day with and without negative prompts.In Figure 2 (a), we see that 2 of the images generated the Eiffel Tower, a famous landmark in Paris. Most of the images appear gray-ish and overcast, with some people in the scene walking with umbrellas. When umbrella is used in the negative prompts, there’s no umbrellas in the scene, and interestingly, thescene appears more colorful (even with the ground looks flooded).
Since no seed is set, generations with the demo are still probabilistic, and testing to check how much of the generated images were affected by adding the negative prompt would require multiple generation of images. Based on this blog post, it seems that negative prompts have more impact on newer SD models than the previous ones.
Regardless, the fact that it’s now possible to start from noise and generate an image that matches the prompt is still mind-boggling.
The rate at which AI image generation is developing is happening at a very rapid pace. With that being said, there are still issues that plague generative models in relation to the datasets the models are being trained on.
LAION, the main source of the dataset is trained on, has a collection of links to scraped image data with alt-text as captions. This means that while the image may be publicly available through the Internet, the copyright to the images used may possibly be infringed. Aside from copyright infringement concerns, someone has private medical record images of themselves in the LAION-5B database.
Currently, Stability AI is being sued by a group of artists and Getty Images for copyright infringement. While text-to-image generation will not replace artists creating art, artists with desired art styles (like Greg Rutkowski) can have their works potentially flooded by AI generated images that they did not create. Currently, there is no way to know if your artwork has been included in the training of a diffusion model aside from checking if you have potentially been trained on.
Some recent developments allow “disrupting style mimicry” done by diffusion models to protect artists. Glaze lets artists apply a “style cloaks” to their art before posting it online. The cloaks are image specific and lets the artist choose a target style. Fine-tuning the algorithm on cloaked images will result in generations that match the target style instead of the image’s style. However, with the rapid pace of development in the field, it is unknown how long it will take before these kinds of protections can be circumvented.
Since diffusion models generate data that follow the distribution of the training set, any biases in the data can be reflected in generated data. The end of the Quanta Magazine article [1] talks about how an avatar generator is creating sexualized images for women but not for men.
Diffusion models take a lot of inspiration from nonequilibrium thermodynamics.
The main goal of diffusion models is to create a sample that closely follows a modeled data distribution. A diffusion model achieves this by gradually adding noise to a distribution and learning how to remove this noise (or recreate a sample from the noise).
Latent diffusion models performs the diffusion in the latent space (like an image embedding), which has a lower dimension than the pixel space.
Many of the problems arising from the rapid advancement of diffusion models relate to the training data: the ethics behind the use of copyrighted data and possible harmful biases from the dataset.
This sounds like what word embeddings are trying to do in LLMs, but for images.↩︎